Update Netherlands Recent corpus definition to read from Parlamint v4 data, including Named Entities #1815

BeritJanssen · 2025-06-11T14:38:12Z

The current branch uses the parlamint_v4.py utils file from #1730. I changed it in the following ways:

added a function to read people and party metadata from the external files (included in the test data)
added various functions for named entity fields, and parsers to enrich with named entity annotations
added functions to flatten the annotated format of the speeches
removed functions that were copied without change from parliament.utils.parlamint.py utils

lukavdplas

The code looks fine! I added some requests for clarity, but other than that, feel free to merge.

lukavdplas · 2025-06-16T12:09:49Z

backend/corpora/parliament/utils/parlamint_v4.py

+"""
+This file was created as an updated utils file for the ParlaMint dataset, version 4.0. The previous utils file
+is based on version 2.0.
+"""


Small note: module level docstrings should be the first line of the file. If you put it after the imports, it's not registered by help, code editors, etc.

I think by "the previous utils file", you're referring to parliament.py? This reference is ambiguous. Also, someone not familiar with the code history would not know which file came first chronologically.

If I were looking over a directory with parliament.py and parliament_v4.py, I would probably assume that parliament.py covers the most recent version and should be the default for new corpora, and the v4 module is for compatibility with some older version.

lukavdplas · 2025-06-16T12:19:52Z

backend/corpora/parliament/utils/parlamint_v4.py

+    else:
+        return False
+
+def transform_current_party_id(data):


This could do with a docstring. What is data? What is is transformed into?

Based on the name, I assumed this function transformed the current party ID into something else, but based on the code, it looks like this function retrieves the current party ID from other data? If so, the name is a bit misleading.

Meesch · 2025-06-18T10:48:59Z

backend/addcorpus/es_mappings.py

-def annotated_text_mapping():
-    return {'type': 'annotated_text'}
+def ner_mapping():
+    return {'type': 'text', 'index': False}


What was the issue with the 'annotated_text' type?

See #1724

TLDR: it wasn't necessary because we already have the unannotated text for full-text search, and keyword fields for finding entities.

BeritJanssen added 4 commits June 11, 2025 13:57

add parlamint_v4 utils

d489384

fix: use plain text mapping for ner field

3c5f42d

Merge branch 'develop' into feature/parlamint-ner

c6776d2

fix: read NL recent from Parlamint v4

c6c94b2

BeritJanssen requested a review from Meesch June 11, 2025 14:38

BeritJanssen changed the title ~~Update Netherlands Recent corpus definition to read from Parlamint v4 data~~ Update Netherlands Recent corpus definition to read from Parlamint v4 data, including Named Entities Jun 11, 2025

lukavdplas approved these changes Jun 16, 2025

View reviewed changes

Meesch reviewed Jun 18, 2025

View reviewed changes

BeritJanssen merged commit d0e76da into develop Jul 14, 2025
1 check passed

BeritJanssen deleted the feature/parlamint-ner branch July 14, 2025 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Netherlands Recent corpus definition to read from Parlamint v4 data, including Named Entities #1815

Update Netherlands Recent corpus definition to read from Parlamint v4 data, including Named Entities #1815

Uh oh!

BeritJanssen commented Jun 11, 2025

Uh oh!

lukavdplas left a comment •

edited

Loading

Uh oh!

lukavdplas Jun 16, 2025

Uh oh!

lukavdplas Jun 16, 2025

Uh oh!

Meesch Jun 18, 2025

Uh oh!

lukavdplas Jun 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Update Netherlands Recent corpus definition to read from Parlamint v4 data, including Named Entities #1815

Update Netherlands Recent corpus definition to read from Parlamint v4 data, including Named Entities #1815

Uh oh!

Conversation

BeritJanssen commented Jun 11, 2025

Uh oh!

lukavdplas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukavdplas Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

lukavdplas Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

Meesch Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

lukavdplas Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lukavdplas left a comment •

edited

Loading